fix: use positional alignment in contrib functions to prevent cartesian-product OOM#317
Open
drussellmrichie wants to merge 7 commits intolarsiusprime:masterfrom
Open
Conversation
Contributor
|
Thank you for your contribution. I affirm that this contributor has signed the CLA 0 out of 3 committers have signed the CLA. |
…rvice Replace slow row-by-row apply() with vectorized geopandas operations: build a GeoSeries of building geometries aligned with the joined dataframe, then call .intersection() and .area in one pass instead of per-row. https://claude.ai/code/session_01Dmffw6eWUC7rrE6JD1SXXX
…-Q7c2a Vectorize building-parcel intersection area calculation in OvertureService
Adds calc_lycd_land_values() to land.py, which values land by: 1. Computing a uniform local land rate per model group from the median market value and median lot size of improved properties 2. Applying that rate to every parcel (land_value = rate × lot_size) This avoids the side-by-side disparity created by applying a flat allocation % to each parcel's individual market value. Supports a user-supplied land_alloc (float or per-group dict) or automatic derivation from vacant vs improved per-unit value ratios. Also fixes missing area_unit import in land.py. https://claude.ai/code/session_011PMWoWpQazUGuCXKT1nmVg
…ation-93UpZ Implement LYCD (Least You Can Do) land valuation method
…diction_to_contribution to prevent cartesian-product OOM Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
dbc2a9a to
73e2f70
Compare
calc_lycd_land_values now accepts a subarea_field parameter (e.g. "neighborhood", "census_tract") to derive a separate local land rate per (subarea, model_group) cell instead of one rate per model group across the entire jurisdiction. Cells below min_improved_per_cell fall back to the model-group rate. _derive_lycd_alloc_from_data updated to match with group_field, subarea_field, and alloc_by_group_fallback parameters. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
## Split-checkpoint contributions strategy (benchmark, pipeline, modeling) - Thread `do_contributions` flag through the full call chain: `run_models`, `_run_models`, `run_one_model`, `run_one_hedonic_model`, `_predict_one_model`, `_run_hedonic_models`, `_write_model_results`, `write_model_parameters` - Add `_DS` and `_SMRContribContext` helper classes (module-level, picklable) - `_write_model_results`: save `_smr_for_contribs.pkl` + `_model_features.json` when `do_contributions=False`; enables deferred SHAP pass checkpointing - New `compute_model_contributions(settings)` in pipeline.py: loads saved `_smr_for_contribs.pkl` files per model group, runs contributions pass, deletes pkl; allows crash recovery without retraining - `finalize_models`: expose `run_main/vacant/hedonic/ensemble`, `do_shaps`, `do_plots`, `do_contributions` as real parameters (were hardcoded True/False) - `_run_models`: add per-group `ind_vars` override from `settings.modeling.models.<mvh>.group_overrides.<group>.ind_vars` ## Bug fixes - `data.py`: `from scipy.spatial import cKDTree` (private submodule removed in scipy >= 1.14) - `data.py`: move `astype(int)` cast to after `.loc` override in `_basic_geo_enrichment` — pandas 2.x rejects float→int64 via `.loc` - `sales_scrutiny_study.py`: go through `.astype(object)` before `.astype(str)` to avoid ArrowNotImplementedError on null-dtype columns (pyarrow >= 22) - `shap_analysis.py`: warn + filter instead of raising ValueError when `list_vars` contains features dropped by variable selection - `tuning.py`: guard `_lightgbm_rolling_origin_cv` against n_samples < n_splits - `utilities/cache.py`: replace `col.values == ...` with `col.eq(...)` and wrap `.sum()` in `int()` — ArrowExtensionArray has no `.values` sum - `utilities/stats.py`: multiple `calc_correlations` guards for sparse/small groups — empty naive_corr early-exit, all-NA score break, `len(remaining) <= 1` early return with correct schema, `.columns.tolist()[0]` for Arrow-backed Index compat ## Misc - `.gitignore`: add `.DS_Store` and `.pytest_cache/` Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
_contrib_to_unit_valuesand_add_prediction_to_contributionboth joindf_contribback todf_base/dfusingpd.merge(..., on=the_key, how="left"). This produces a cartesian product and an OOM crash under the following combination of conditions:df_basehas a non-sequential index (e.g. aftersort_values()— the index retains the pre-sort positions)."key"or"key_sale") is backed by PyArrow string dtype, which triggers pandas' slow/non-unique merge path even when the key values are logically unique.The crash surfaces as a
MemoryErroror silent process kill insidepd.mergebefore the result is materialized.Why positional alignment is correct
df_contribis constructed row-by-row fromdf_basein the same iteration, so the two DataFrames are always in the same row order. Areset_index+pd.concat(..., axis=1)is therefore semantically equivalent to the merge, without the non-unique-join hazard.A size-mismatch guard falls back to the key-based merge with a
UserWarning, preserving correctness in any unexpected edge case.Fix
_contrib_to_unit_values: replace unconditionalmergewith positionalreset_index + concat; fall back to merge with a warning if row counts differ._add_prediction_to_contribution: same pattern.Reproducing
The crash is difficult to reproduce on a small synthetic dataset because small Arrow string columns typically fall through to pandas' fast unique-key path. It was observed on a real Philadelphia OPA dataset (~421k parcels). No minimal repro is included with this PR; if a test fixture is needed we can work on one together.